prompt style
Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation
Robust workflow composition is critical for effective agent performance, yet progress in Large Language Model (LLM) planning and reasoning is hindered by a scarcity of scalable evaluation data. This work introduces NL2Flow, a fully automated pipeline for generating and evaluating workflow planning problems. NL2Flow generates problems parametrically in a structured intermediate representation, translating them into both natural language and formal PDDL. I evaluate several open-source, instruct-tuned LLMs on a dataset of 2296 low-difficulty problems generated by NL2Flow. Results demonstrate that the best-performing model achieved 86% success in generating valid plans and 69% in generating optimal plans (for solvable problems). Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Importantly, translating natural language problems into a structured JSON representation prior to symbolic planning significantly improved success rates, suggesting a benefit from neuro-symbolic integration. These findings underscore the importance of understanding error sources within LLM reasoning as systems scale to more complex tasks. As LLM reasoning scales to increasingly complex problems, understanding the shifting bottlenecks and sources of error within these systems will be crucial.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Asia > Vietnam > Hanoi > Hanoi (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
How Model Size, Temperature, and Prompt Style Affect LLM-Human Assessment Score Alignment
Jung, Julie, Lu, Max, Benker, Sina Chole, Darici, Dogus
We examined how model size, temperature, and prompt style affect Large Language Models' (LLMs) alignment within itself, between models, and with human in assessing clinical reasoning skills. Model size emerged as a key factor in LLM-human score alignment. Study highlights the importance of checking alignments across multiple levels.
- North America > United States > Colorado > Denver County > Denver (0.04)
- Europe > Germany (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine (1.00)
- Education > Assessment & Standards (0.69)
Evaluating Generalization and Representation Stability in Small LMs via Prompting, Fine-Tuning and Out-of-Distribution Prompts
We investigate the generalization capabilities of small language models under two popular adaptation paradigms: few-shot prompting and supervised fine-tuning. While prompting is often favored for its parameter efficiency and flexibility, it remains unclear how robust this approach is in low-resource settings and under distributional shifts. This paper presents a comparative study of prompting and fine-tuning across task formats, prompt styles, and model scales, with a focus on their behavior in both in-distribution and out-of-distribution (OOD) settings. Beyond accuracy, we analyze the internal representations learned by each approach to assess the stability and abstraction of task-specific features. Our findings highlight critical differences in how small models internalize and generalize knowledge under different adaptation strategies. This work offers practical guidance for model selection in low-data regimes and contributes empirical insight into the ongoing debate over prompting versus fine-tuning. Code for the experiments is available at the following
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada (0.04)
An Empirical Study of LLM Reasoning Ability Under Strict Output Length Constraint
Sun, Yi, Wang, Han, Li, Jiaqiang, Liu, Jiacheng, Li, Xiangyu, Wen, Hao, Yuan, Yizhen, Zheng, Huiwen, Liang, Yan, Li, Yuanchun, Liu, Yunxin
Recent work has demonstrated the remarkable potential of Large Language Models (LLMs) in test-time scaling. By making models think before answering, they are able to achieve much higher accuracy with extra inference computation. However, in many real-world scenarios, models are used under time constraints, where an answer should be given within a certain output length. It is unclear whether and how the reasoning ability of different LLMs remain effective under strict constraints. We take a first look at this problem by conducting an in-depth empirical study. Specifically, we test 30 LLMs on common reasoning datasets under a wide range of output length budgets, and we analyze the correlation between the inference accuracy and various properties including model type, model size, prompt style, etc. We also consider the mappings between token budgets and actual on-device latency budgets. The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning ability that differ from the unconstrained situation, e.g. the optimal choices of either model size or prompt style change under different budgets. These findings offer timely evaluation to this area and practical guidance for users to deploy LLMs under real-world latency constraints.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Middle East > Jordan (0.04)
How Do Generative Models Draw a Software Engineer? A Case Study on Stable Diffusion Bias
Fadahunsi, Tosin, d'Aloisio, Giordano, Di Marco, Antinisca, Sarro, Federica
Generative models are nowadays widely used to generate graphical content used for multiple purposes, e.g. web, art, advertisement. However, it has been shown that the images generated by these models could reinforce societal biases already existing in specific contexts. In this paper, we focus on understanding if this is the case when one generates images related to various software engineering tasks. In fact, the Software Engineering (SE) community is not immune from gender and ethnicity disparities, which could be amplified by the use of these models. Hence, if used without consciousness, artificially generated images could reinforce these biases in the SE domain. Specifically, we perform an extensive empirical evaluation of the gender and ethnicity bias exposed by three versions of the Stable Diffusion (SD) model (a very popular open-source text-to-image model) - SD 2, SD XL, and SD 3 - towards SE tasks. We obtain 6,720 images by feeding each model with two sets of prompts describing different software-related tasks: one set includes the Software Engineer keyword, and one set does not include any specification of the person performing the task. Next, we evaluate the gender and ethnicity disparities in the generated images. Results show how all models are significantly biased towards male figures when representing software engineers. On the contrary, while SD 2 and SD XL are strongly biased towards White figures, SD 3 is slightly more biased towards Asian figures. Nevertheless, all models significantly under-represent Black and Arab figures, regardless of the prompt style used. The results of our analysis highlight severe concerns about adopting those models to generate content for SE tasks and open the field for future research on bias mitigation in this context.
- Europe > Switzerland (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy > Abruzzo > L'Aquila Province > L'Aquila (0.04)
- (2 more...)
Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles
Xia, Yuxi, de Araujo, Pedro Henrique Luz, Zaporojets, Klim, Roth, Benjamin
Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal loss improve calibration from baselines. We find that few-shot prompts are the most effective for auxiliary model-based methods, and auxiliary models demonstrate robust calibration performance across accuracy variations, outperforming LLMs' internal probabilities and verbalized confidences. These insights deepen the understanding of influence factors in LLM calibration, supporting their reliable deployment in diverse applications.
Exploring LLM-Driven Explanations for Quantum Algorithms
d'Aloisio, Giordano, Fortz, Sophie, Hanna, Carol, Fortunato, Daniel, Bensoussan, Avner, Usandizaga, Eñaut Mendiluze, Sarro, Federica
Background: Quantum computing is a rapidly growing new programming paradigm that brings significant changes to the design and implementation of algorithms. Understanding quantum algorithms requires knowledge of physics and mathematics, which can be challenging for software developers. Aims: In this work, we provide a first analysis of how LLMs can support developers' understanding of quantum code. Method: We empirically analyse and compare the quality of explanations provided by three widely adopted LLMs (Gpt3.5, Llama2, and Tinyllama) using two different human-written prompt styles for seven state-of-the-art quantum algorithms. We also analyse how consistent LLM explanations are over multiple rounds and how LLMs can improve existing descriptions of quantum algorithms. Results: Llama2 provides the highest quality explanations from scratch, while Gpt3.5 emerged as the LLM best suited to improve existing explanations. In addition, we show that adding a small amount of context to the prompt significantly improves the quality of explanations. Finally, we observe how explanations are qualitatively and syntactically consistent over multiple rounds. Conclusions: This work highlights promising results, and opens challenges for future research in the field of LLMs for quantum code explanation. Future work includes refining the methods through prompt optimisation and parsing of quantum code explanations, as well as carrying out a systematic assessment of the quality of explanations.
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.05)
- Europe > United Kingdom > England > Greater London > London (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.48)
How Does Diverse Interpretability of Textual Prompts Impact Medical Vision-Language Zero-Shot Tasks?
Wang, Sicheng, Liu, Che, Arcucci, Rossella
Recent advancements in medical vision-language pre-training (MedVLP) have significantly enhanced zero-shot medical vision tasks such as image classification by leveraging large-scale medical image-text pair pre-training. However, the performance of these tasks can be heavily influenced by the variability in textual prompts describing the categories, necessitating robustness in MedVLP models to diverse prompt styles. Yet, this sensitivity remains underexplored. In this work, we are the first to systematically assess the sensitivity of three widely-used MedVLP methods to a variety of prompts across 15 different diseases. To achieve this, we designed six unique prompt styles to mirror real clinical scenarios, which were subsequently ranked by interpretability. Our findings indicate that all MedVLP models evaluated show unstable performance across different prompt styles, suggesting a lack of robustness. Additionally, the models' performance varied with increasing prompt interpretability, revealing difficulties in comprehending complex medical concepts. This study underscores the need for further development in MedVLP methodologies to enhance their robustness to diverse zero-shot prompts.
- Europe > United Kingdom (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- North America > United States (0.04)
- Europe > Switzerland (0.04)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Nuclear Medicine (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Investigating Context Effects in Similarity Judgements in Large Language Models
Uprety, Sagar, Jaiswal, Amit Kumar, Liu, Haiming, Song, Dawei
Large Language Models (LLMs) have revolutionised the capability of AI models in comprehending and generating natural language text. They are increasingly being used to empower and deploy agents in real-world scenarios, which make decisions and take actions based on their understanding of the context. Therefore researchers, policy makers and enterprises alike are working towards ensuring that the decisions made by these agents align with human values and user expectations. That being said, human values and decisions are not always straightforward to measure and are subject to different cognitive biases. There is a vast section of literature in Behavioural Science which studies biases in human judgements. In this work we report an ongoing investigation on alignment of LLMs with human judgements affected by order bias. Specifically, we focus on a famous human study which showed evidence of order effects in similarity judgements, and replicate it with various popular LLMs. We report the different settings where LLMs exhibit human-like order effect bias and discuss the implications of these findings to inform the design and development of LLM based applications.
- North America > United States > District of Columbia > Washington (0.05)
- Asia > China (0.05)
- Asia > North Korea (0.05)
- (6 more...)
Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis
Sinha, Neelabh, Jain, Vinija, Chadha, Aman
The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging. This work conducts an in-depth experimental analysis of the semantic correctness of outputs of 10 smaller, open LMs across three aspects: task types, application domains and reasoning types, using diverse prompt styles. We demonstrate that most effective models and prompt styles vary depending on the specific requirements. Our analysis provides a comparative assessment of LMs and prompt styles using a proposed three-tier schema of aspects for their strategic selection based on use-case and other constraints. We also show that if utilized appropriately, these LMs can compete with, and sometimes outperform, SOTA LLMs like DeepSeek-v2, GPT-3.5-Turbo, and GPT-4o.
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Singapore (0.04)
- North America > Dominican Republic (0.04)
- (5 more...)
- Banking & Finance (0.46)
- Information Technology > Security & Privacy (0.45)